AITopics

Country: North America > United States (0.46)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Nuclear Medicine (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(2 more...)

Neural Information Processing SystemsJun-19-2026, 08:22:32 GMT

Multi-Modal View Enhanced Large Vision Models for Long-Term Time Series Forecasting

Time series, typically represented as numerical sequences, can also be transformed into images and texts, offering multi-modal views (MMVs) of the same underlying signal. These MMVs can reveal complementary patterns and enable the use of powerful pre-trained large models, such as large vision models (LVMs), for long-term time series forecasting (LTSF). However, as we identified in this work, the state-ofthe-art (SOTA) LVM-based forecaster poses an inductive bias towards "forecasting periods". To harness this bias, we propose DMMV, a novel decomposition-based multi-modal view framework that leverages trend-seasonal decomposition and a novel backcast-residual based adaptive decomposition to integrate MMVs for LTSF. Comparative evaluations against 14 SOTA models across diverse datasets show that DMMV outperforms single-view and existing multi-modal baselines, achieving the best mean squared error (MSE) on 6 out of 8 benchmark datasets. The code for this paper is available at: https://github.com/D2I-Group/dmmv.

data mining, machine learning, natural language, (18 more...)

Country: North America > United States (0.45)

Genre: Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area (0.67)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Neural Information Processing SystemsFeb-12-2026, 12:55:44 GMT

An Adaptive Deep RL Method for Non-Stationary Environments with Piecewise Stable Context

The inferred belief context can be leveraged to augment the state, leading to a policy that can adapt to abrupt variations in context.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Robots (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceDec-1-2025

Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning

Gupta, Gunshi, Yadav, Karmesh, Kira, Zsolt, Gal, Yarin, Aljundi, Rahaf

To enable embodied agents to operate effectively over extended timeframes, it is crucial to develop models that form and access memories to stay contextualized in their environment. In the current paradigm of training transformer-based policies for embodied sequential decision-making tasks, visual inputs often overwhelm the context limits of transformers, while humans can maintain and utilize a lifetime of experience compressed as memories. Significant compression is possible in principle, as much of the input is irrelevant and can be abstracted. However, existing approaches predominantly focus on either recurrent models with fixed-size memory or transformers with full-context reliance. In this work, we propose Memo, a transformer-based architecture and training recipe for reinforcement learning (RL) on memory-intensive, long-horizon tasks. Memo incorporates the creation and retrieval of memory by interleaving periodic summarization tokens with the inputs of a model during training. We demonstrate Memo's effectiveness on a gridworld meta-RL benchmark and a multi-object navigation task in photo-realistic indoor settings. Memo outperforms naive long-context transformer baselines while being more compute and storage efficient. Additionally, Memo generalizes better to longer contexts at inference time and remains robust in streaming settings, where historical context must be truncated to fit inference constraints. Our code is available at: https://github.com/gunshi/memo.

machine learning, natural language, reinforcement learning, (18 more...)

2510.19732

Country: Europe (0.93)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(2 more...)

arXiv.org Artificial IntelligenceNov-3-2025

Multi-Modal View Enhanced Large Vision Models for Long-Term Time Series Forecasting

Shen, ChengAo, Yu, Wenchao, Zhao, Ziming, Song, Dongjin, Cheng, Wei, Chen, Haifeng, Ni, Jingchao

Time series, typically represented as numerical sequences, can also be transformed into images and texts, offering multi-modal views (MMVs) of the same underlying signal. These MMVs can reveal complementary patterns and enable the use of powerful pre-trained large models, such as large vision models (LVMs), for long-term time series forecasting (LTSF). However, as we identified in this work, the state-of-the-art (SOTA) LVM-based forecaster poses an inductive bias towards "forecasting periods". To harness this bias, we propose DMMV, a novel decomposition-based multi-modal view framework that leverages trend-seasonal decomposition and a novel backcast-residual based adaptive decomposition to integrate MMVs for LTSF. Comparative evaluations against 14 SOTA models across diverse datasets show that DMMV outperforms single-view and existing multi-modal baselines, achieving the best mean squared error (MSE) on 6 out of 8 benchmark datasets. The code for this paper is available at: https://github.com/D2I-Group/dmmv.

data mining, large language model, machine learning, (20 more...)

2505.24003

Country: North America > United States (0.45)

Genre: Research Report (0.63)

Industry: Health & Medicine > Therapeutic Area (0.67)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)

Lanzendörfer, Luca A., Grötschla, Florian, Blaser, Cesare, Wattenhofer, Roger

Benchmarking Diarization Models

arXiv.org Artificial IntelligenceOct-1-2025

Speaker diarization is the task of partitioning audio into segments according to speaker identity, answering the question of "who spoke when" in multi-speaker conversation recordings. While diarization is an essential task for many downstream applications, it remains an unsolved problem. Errors in diarization propagate to downstream systems and cause wide-ranging failures. To this end, we examine exact failure modes by evaluating five state-of-the-art diarization models, across four diarization datasets spanning multiple languages and acoustic conditions. The evaluation datasets consist of 196.6 hours of multilingual audio, including English, Mandarin, German, Japanese, and Spanish. Overall, we find that PyannoteAI achieves the best performance at 11.2% DER, while DiariZen provides a competitive open-source alternative at 13.3% DER. When analyzing failure cases, we find that the primary cause of diarization errors stem from missed speech segments followed by speaker confusion, especially in high-speaker count settings.

artificial intelligence, diarization, machine learning, (18 more...)

2509.26177

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.46)

arXiv.org Artificial IntelligenceSep-4-2025

S2M2ECG: Spatio-temporal bi-directional State Space Model Enabled Multi-branch Mamba for ECG

Zhang, Huaicheng, Wang, Ruoxin, Zhou, Chenlian, Shi, Jiguang, Ge, Yue, Li, Zhoutong, Chang, Sheng, Wang, Hao, He, Jin, Huang, Qijun

As one of the most effective methods for cardiovascular disease (CVD) diagnosis, multi-lead Electrocardiogram (ECG) signals present a characteristic multi-sensor information fusion challenge that has been continuously researched in deep learning domains. Despite the numerous algorithms proposed with different DL architectures, maintaining a balance among performance, computational complexity, and multi-source ECG feature fusion remains challenging. Recently, state space models (SSMs), particularly Mamba, have demonstrated remarkable effectiveness across various fields. Their inherent design for high-efficiency computation and linear complexity makes them particularly suitable for low-dimensional data like ECGs. This work proposes S2M2ECG, an SSM architecture featuring three-level fusion mechanisms: (1) Spatio-temporal bi-directional SSMs with segment tokenization for low-level signal fusion, (2) Intra-lead temporal information fusion with bi-directional scanning to enhance recognition accuracy in both forward and backward directions, (3) Cross-lead feature interaction modules for spatial information fusion. To fully leverage the ECG-specific multi-lead mechanisms inherent in ECG signals, a multi-branch design and lead fusion modules are incorporated, enabling individual analysis of each lead while ensuring seamless integration with others. Experimental results reveal that S2M2ECG achieves superior performance in the rhythmic, morphological, and clinical scenarios. Moreover, its lightweight architecture ensures it has nearly the fewest parameters among existing models, making it highly suitable for efficient inference and convenient deployment. Collectively, S2M2ECG offers a promising alternative that strikes an excellent balance among performance, computational complexity, and ECG-specific characteristics, paving the way for high-performance, lightweight computations in CVD diagnosis.

artificial intelligence, ecg, machine learning, (20 more...)

2509.03066

Country: Asia > China (0.47)

Genre: Research Report > Experimental Study (0.46)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Diagnostic Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsAug-19-2025, 14:42:52 GMT

An Adaptive Deep RL Method for Non-Stationary Environments with Piecewise Stable Context

The inferred belief context can be leveraged to augment the state, leading to a policy that can adapt to abrupt variations in context.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Robots (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Carbonneau, Marc-André, van Niekerk, Benjamin, Seuté, Hugo, Letendre, Jean-Philippe, Kamper, Herman, Zaïdi, Julian

Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis

arXiv.org Artificial IntelligenceJul-4-2025

Modeling voice identity is challenging due to its multifaceted nature. In generative speech systems, identity is often assessed using automatic speaker verification (ASV) embeddings, designed for discrimination rather than characterizing identity. This paper investigates which aspects of a voice are captured in such representations. We find that widely used ASV embeddings focus mainly on static features like timbre and pitch range, while neglecting dynamic elements such as rhythm. We also identify confounding factors that compromise speaker similarity measurements and suggest mitigation strategies. To address these gaps, we propose U3D, a metric that evaluates speakers' dynamic rhythm patterns. This work contributes to the ongoing challenge of assessing speaker identity consistency in the context of ever-better voice cloning systems. We publicly release our code.

artificial intelligence, machine learning, utterance, (16 more...)

2507.02176

Country:

North America > United States > Pennsylvania (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment > Sports > Track & Field (0.40)
Leisure & Entertainment > Sports > Running (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.67)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)

Lanz, Vojtěch, Hajič, Jan jr

Gregorian melody, modality, and memory: Segmenting chant with Bayesian nonparametrics

arXiv.org Artificial IntelligenceJul-2-2025

The idea that Gregorian melodies are constructed from some vocabulary of segments has long been a part of chant scholarship. This so-called "centonisation" theory has received much musicological criticism, but frequent re-use of certain melodic segments has been observed in chant melodies, and the intractable number of possible segmentations allowed the option that some undiscovered segmentation exists that will yet prove the value of centonisation, and recent empirical results have shown that segmentations can outperform music-theoretical features in mode classification. Inspired by the fact that Gregorian chant was memorised, we search for an optimal unsupervised segmentation of chant melody using nested hierarchical Pitman-Yor language models. The segmentation we find achieves state-of-the-art performance in mode classification. Modeling a monk memorising the melodies from one liturgical manuscript, we then find empirical evidence for the link between mode classification and memory efficiency, and observe more formulaic areas at the beginnings and ends of melodies corresponding to the practical role of modality in performance. However, the resulting segmentations themselves indicate that even such a memory-optimal segmentation is not what is understood as centonisation.

machine learning, natural language, segmentation, (20 more...)

2507.0038

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States (0.04)
Asia > Singapore (0.04)
(6 more...)

Genre: Research Report (0.82)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)